Tutorial: Build an Evaluation pipeline

To iterate on an application, we need a way to evaluate if it’s improving. To do so, a common practice is to test it against the same set of examples when there is a change. Weave has a first-class way to track evaluations with Model & Evaluation classes. We have built the APIs to make minimal assumptions to allow for the flexibility to support a wide array of use-cases. Evals hero

1. Build a `Model`

Models store and version information about your system, such as prompts, temperatures, and more. Weave automatically captures when they are used and updates the version when there are changes. Models are declared by subclassing Model and implementing a predict function definition, which takes one example and returns the response.

import json
import openai
import weave

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()

        response = await client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
            ],
        )
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        parsed = json.loads(result)
        return parsed

You can instantiate Model objects as normal like this:

import asyncio
import weave

weave.init('intro-example')

model = ExtractFruitsModel(
    model_name='gpt-3.5-turbo-1106',
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
)
sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."
print(asyncio.run(model.predict(sentence)))
# if you're in a Jupyter Notebook, run:
# await model.predict(sentence)

Checkout the Models guide to learn more.

2. Collect some examples

Next, you need a dataset to evaluate your model on. A Dataset is just a collection of examples stored as a Weave object. You’ll be able to download, browse and run evaluations on datasets in the Weave UI. Here we build a list of examples in code, but you can also log them one at a time from your running application.

sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
]
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
]

Then publish your dataset:

import weave
# highlight-next-line
weave.init('intro-example')
dataset = weave.Dataset(name='fruits', rows=examples)
# highlight-next-line
weave.publish(dataset)

Checkout the Datasets guide to learn more.

3. Define scoring functions

Evaluations assess a Models performance on a set of examples using a list of specified scoring functions or weave.scorer.Scorer classes.

# highlight-next-line
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

@weave.op()
def fruit_name_score(target: dict, output: dict) -> dict:
    return {'correct': target['fruit'] == output['fruit']}

To make your own scoring function, learn more in the Scorers guide.In some applications we want to create custom Scorer classes - where for example a standardized LLMJudge class should be created with specific parameters (e.g. chat model, prompt), specific scoring of each row, and specific calculation of an aggregate score. See the tutorial on defining a Scorer class in the next chapter on Model-Based Evaluation of RAG applications for more information.

4. Run the evaluation

Now, you’re ready to run an evaluation of ExtractFruitsModel on the fruits dataset using your scoring function.

import asyncio
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

weave.init('intro-example')

evaluation = weave.Evaluation(
    name='fruit_eval',
    dataset=dataset, 
    scorers=[
        MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), 
        fruit_name_score
    ],
)
print(asyncio.run(evaluation.evaluate(model)))
# if you're in a Jupyter Notebook, run:
# await evaluation.evaluate(model)

If you’re running from a python script, you’ll need to use asyncio.run. However, if you’re running from a Jupyter notebook, you can use await directly.

5. View your evaluation results

Weave will automatically capture traces of each prediction and score. Click on the link printed by the evaluation to view the results in the Weave UI. Evaluation results

What’s next?

Learn how to:

Compare model performance: Try different models and compare their results
Explore Built-in Scorers: Check out Weave’s built-in scoring functions in our Scorers guide
Build a RAG app: Follow our RAG tutorial to learn about evaluating retrieval-augmented generation
Advanced evaluation patterns: Learn about Model-Based Evaluation for using LLMs as judges

Documentation Index

​1. Build a Model

​2. Collect some examples

​3. Define scoring functions

​4. Run the evaluation

​5. View your evaluation results

​What’s next?

1. Build a `Model`

2. Collect some examples

3. Define scoring functions

4. Run the evaluation

5. View your evaluation results

What’s next?